VLM: Onboarding native resolution, native aspect ratio, interleaved VLM training #1615

lkhphuc · 2025-08-21T13:05:23Z

First PR to onboarding modern VLM training to torchtitan.

Features:

Native Aspect Ratio: not limited to square crops.
Native Resolution: images in a batch can have different sizes, no more image tiles and thumbnails.
Native Interleaved data: training samples can have variable number of images, interleaved with text at different position. You can train more than just a captioning model.

Design

Distributed training usually does not play nice with input of varying shapes. To handle a varying number of images and image sizes, we requires two additional hyperparameters, number of images per batch N and max image patches length L, then we pad the actual image patches to this fixed size.

After tok_embedding, we obtain tokens of shape BxS.
After encoder, we obtain visual tokens of shape NxL.
We extract the valid visual tokens only
Then scatter those tokens to their actual positions in the LLM input tokens.

This requires the dataloader to handle the following aspect:

Interleave the correct precise numbers of image tokens in the inputs token based on encoder's patch size and input images' size
Convert images/videos to 1D sequence of patchs:
- rearrange(pixels, 'n (t pt) (h ph) (w pw) c -> n (t h w) (pt p pw c)', pt=temporal_ps, ph=patch_size, pw=patch_size)
- Pad all image patches sequence to a fixed length and return pixel_values.shape == [N, L, D]
Return a grid_thw.shape == [N, L, 3] to keep track of the location indicies of each patches in the images. Padding image can be tracked in the same tensors with values -1.

This result in a very simple and general interface to train modern VLM with interleaved data and native resolution & aspect ratio:

Depending on data mixtures, we can set dataloader's hyperparameters N, L to have minimal empty image padding (in batch dimension).
Use modern pytorch features (Flex Attention, compile etc) for efficient handling of different attention mask per (padding in sequence dimension).
Interface nicely with TP, PP, etc

In this PR

Minimal interleaved Obelics dataloader with native resolution and aspect ratio.
- The dataloader is currently very slow, as it need to download images from internet everytime you run. (Same thing for the current imp in the multimodal experiment).
Siglip2 model code, mostly based on HF.
VLM model code called Llama3Siglip2 connecting the two vision encoder and language decoder.
Minimal infra code for debug model to run

Todo:

Add support for captioning HF dataset that has images stored inside the dataset (CC12M like Flux exp?) so it's not super slow to load
Flex Attention for encoder.
Modify Llama3 tokenizer to add special tokens.
Script to combine Siglip2 + Llama3 weights and load.
Test Siglip2 encoder correctness.
Multimodal CE loss to correct for image token bias
All the parallelisms DP, CP, TP, PP.

tmp tmp

tmp

wwwjn

Thanks for making the great PR! I learned a lot from this PR personally. However, I feel like the data preprocessing part in mm_collator_nld.py is a little bit hard to follow and read.

For image preprocessing, it mainly happens inmm_collator_nld.py , and the collator functions contains following steps for images:

Patchify
Generate Grids with coordinations
Padding/ truncate
Assemble as batched outputs

And text preprocessing is mainly handled in mm_dataset.py, which also contains several steps, eg padding with <image> tokens, Tokenization, mask out <image> tokens in label.

I was wondering can we future split the image and text preprocessing function into smaller code pieces, adding tensor shape hints, or even adding examples like experiments/multimodal? In this way, we could increase readability and easy to debug

The VLM modeling parts LGTM, it's clear structured!

tests/assets/tokenizer/tokenizer.json

wwwjn · 2025-08-21T18:40:50Z

torchtitan/experiments/vlm/datasets/mm_datasets.py

+                        # Normalize with OpenAI CLIP mean/std
+                        mean = np.array([0.48145466, 0.4578275, 0.40821073])
+                        std = np.array([0.26862954, 0.26130258, 0.27577711])
+                        img_array = (img_array - mean) / std


n00b question: why we use CLIP mean/std to normalize the dataset? Is it a common practice?

This is dependent on the pretrained vision encoder. We hardcoded it here because Siglip2 is trained with this normalization.
Ideally it should be part of the model_args because it depends on the model, and we access it here. However the current API does not expose model_args to the build_dataloader function. Should we expose it?

Previously we want to separate dataloader and model because they are unrelated, and tokenizer are the bridge between these two for text inputs. From the diagram and the code, seems that for image inputs, the VisionTransformer + projector works equivalently as bridge between Llama model and dataloader. As we only support Siglip2 encoder now, we can hardcode is and I think we shouldn't expose model_args to build_dataloader now

btw do we need to load pre-trained VisionTransformer weight iduring pre-training? Or during pre-training, it is trained together with the main llama model?

Yes we will load both pretrained siglip2 and llama3. "VLM pretraining" usually refers to starting with separately pretrained vision encoder and text decoder, only the projector connecting them are randomly init.

torchtitan/experiments/vlm/datasets/mm_collator_nld.py

torchtitan/experiments/vlm/datasets/mm_datasets.py

torchtitan/experiments/vlm/datasets/mm_collator_nld.py

torchtitan/experiments/vlm/README.md

wwwjn · 2025-08-26T23:09:42Z

torchtitan/experiments/vlm/README.md

+
+
+## Design
+Distributed training usually does not play nice with input of varying shapes. To handle a varying number of images and image sizes, we requires two hyperparameters, image batch size `N` and image length `L` (in patches), and pad the actual image patches to this fixed size. 


I think including the digram in the PR description will make this part much clear! We should figure out a way to include images in README.

torchtitan/experiments/vlm/datasets/mm_datasets.py

wwwjn · 2025-08-27T00:35:30Z

torchtitan/experiments/vlm/datasets/mm_datasets.py

+        dp_world_size: int = 1,
+        infinite: bool = False,
+        patch_size: int = 16,
+        merge_size: int = 1,


Can you briefly explain what merge_size is doing during image processing? I find a lot of functions are using merge_size, but all of them are assigned to 1. Is there any case merge_size is not 1?

torchtitan/experiments/vlm/datasets/mm_datasets.py

wwwjn · 2025-08-27T01:07:50Z

torchtitan/experiments/vlm/README.md

+This folder showcases how to train modern Vision Language Model (vlm) in torchtitan.
+
+
+## Features:


This part is mainly describing dataloader features. Can we separate into 2 parts in README: 1) model features (What model/encoder, What you have achieved now, eg FSDP, AC, compile, and TODOs); 2) dataloader features

torchtitan/experiments/vlm/infra/parallelize.py

torchtitan/experiments/vlm/datasets/mm_collator_nld.py

wwwjn · 2025-08-27T23:52:35Z

torchtitan/experiments/vlm/infra/parallelize.py

+            return module
+
+
+def apply_ac(model: nn.Module, ac_config):


You can reuse apply_ac() function, this is a common building block under torchtitan.distributed.activation_checkpoint

Sorry, I don't see it torchtitan.distributed.activation_checkpoint?

Sorry my bad, the change is still in a PR: https://github.com/pytorch/torchtitan/pull/1645/files. For now we import apply_ac from torchtitan.models.llama3.infra.parallelize import apply_ac

wwwjn · 2025-08-27T23:53:53Z

torchtitan/experiments/vlm/infra/parallelize.py

+    logger.info(f"Applied {ac_config.mode} activation checkpointing to the model {type(model).__name__}")
+
+
+def apply_compile(model: nn.Module):


Same for apply_compile, apply_ddp, we could reuse this part

torchtitan/experiments/vlm/model/args.py

Co-authored-by: Jiani Wang <[email protected]>

lkhphuc

Thanks for your detailed review Jiani. We have substantially refactor the dataloader part as your suggestion, as well as make some modification to make it more robust.

Now we can run both interleaved Obelics dataset and CC12M captioning dataset. The CC12M dataset from HF have images included in the dataset itself, so should run substantially faster once it's downloaded.
We include both here still to demonstrate the generality of the approach, we can handle both interleave and captioning data with no code or model change.

We also include the Sample Packing features to the dataloader. This is analogous to "document packing" in Llama3 training, and only pack the LLM samples ("<|image|> text text... <|eos|> text <|image|>"). The vision encoder still operates on images with shape NxLxD where each n in N is a single image.

torchtitan/experiments/vlm/datasets/mm_collator_nld.py

lkhphuc · 2025-08-28T10:25:04Z

torchtitan/experiments/vlm/infra/parallelize.py

+            return module
+
+
+def apply_ac(model: nn.Module, ac_config):


Sorry, I don't see it torchtitan.distributed.activation_checkpoint?

torchtitan/experiments/vlm/model/args.py

Co-Authored-By: Ankit Singh <[email protected]>

lkhphuc added 3 commits August 21, 2025 15:48

[vlm] add Obelics interleaved dataloader

2c3609b

tmp tmp

[vlm] model code w/ siglip2 encoder

cad098a

tmp

[vlm] infra to get debugmodel running

f893c4a

lkhphuc requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 21, 2025 13:05

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2025

wwwjn reviewed Aug 21, 2025

View reviewed changes

wwwjn reviewed Aug 27, 2025

View reviewed changes

wwwjn reviewed Aug 28, 2025

View reviewed changes

lkhphuc added 3 commits August 28, 2025 11:15

add vlm dummy tokenizer assets

fbca6dc

mm dataset: refactor and add sample packing

3b9860e

revert tests/assets/tokenizer vlm changes

12027be

lkhphuc force-pushed the lkhphuc-vlm branch from b55bd56 to 12027be Compare August 28, 2025 10:21

Apply suggestion from @wwwjn

6cf9d67

Co-authored-by: Jiani Wang <[email protected]>

lkhphuc commented Aug 28, 2025

View reviewed changes

update vlm/readme.md

0034eaa

Co-Authored-By: Ankit Singh <[email protected]>

lkhphuc force-pushed the lkhphuc-vlm branch from 9597f55 to 0034eaa Compare August 28, 2025 15:56



		## Design
		Distributed training usually does not play nice with input of varying shapes. To handle a varying number of images and image sizes, we requires two hyperparameters, image batch size `N` and image length `L` (in patches), and pad the actual image patches to this fixed size.

		This folder showcases how to train modern Vision Language Model (vlm) in torchtitan.


		## Features:

		logger.info(f"Applied {ac_config.mode} activation checkpointing to the model {type(model).__name__}")


		def apply_compile(model: nn.Module):

VLM: Onboarding native resolution, native aspect ratio, interleaved VLM training #1615

Are you sure you want to change the base?

VLM: Onboarding native resolution, native aspect ratio, interleaved VLM training #1615

Uh oh!

Conversation

lkhphuc commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features:

Design

In this PR

Todo:

Uh oh!

wwwjn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lkhphuc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lkhphuc commented Aug 21, 2025 •

edited

Loading

wwwjn left a comment •

edited

Loading

lkhphuc left a comment •

edited

Loading